library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(here)
## here() starts at /Users/racquellemangahas/Desktop/stat547_class/project/group_13
library(ggplot2)
library(tidyr)

Task 1: Choosing a dataset

We found the dataset at: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016/data

Task 2: Project Proposal & EDA

2.1: Introduce and describe your dataset

This compiled dataset pulled from four other datasets linked by time and place was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. The inspiration for this study was to prevent suicide. This data set includes 11 columns and provides information about country, year, sex, age group, count of suicides, population, suicide rate, country-year composite key, gdp_for_year, gdp_per_capita, generation (based on age grouping average).

The references for this study are:

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

2.2: Load your dataset

suiciderates<- read.table(("suiciderates.csv"),sep=" ")

Peek at dataset:

DT::datatable(suiciderates)

2.3: Explore your dataset

Exploratory Data Analysis of ‘suiciderates’

How many rows?

nrow(suiciderates)
## [1] 27820

How many columns?

ncol(suiciderates)
## [1] 12

Summary of suiciderates dataset:

summary(suiciderates)
##         country           year          sex                 age      
##  Austria    :  382   Min.   :1985   female:13910   15-24 years:4642  
##  Iceland    :  382   1st Qu.:1995   male  :13910   25-34 years:4642  
##  Mauritius  :  382   Median :2002                  35-54 years:4642  
##  Netherlands:  382   Mean   :2001                  5-14 years :4610  
##  Argentina  :  372   3rd Qu.:2008                  55-74 years:4642  
##  Belgium    :  372   Max.   :2016                  75+ years  :4642  
##  (Other)    :25548                                                   
##   suicides_no        population       suicides.100k.pop
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00   
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92   
##  Median :   25.0   Median :  430150   Median :  5.99   
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82   
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62   
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97   
##                                                        
##       country.year    HDI.for.year   gdp_for_year....   
##  Albania1987:   12   Min.   :0.483   Min.   :4.692e+07  
##  Albania1988:   12   1st Qu.:0.713   1st Qu.:8.985e+09  
##  Albania1989:   12   Median :0.779   Median :4.811e+10  
##  Albania1992:   12   Mean   :0.777   Mean   :4.456e+11  
##  Albania1993:   12   3rd Qu.:0.855   3rd Qu.:2.602e+11  
##  Albania1994:   12   Max.   :0.944   Max.   :1.812e+13  
##  (Other)    :27748   NA's   :19456                      
##  gdp_per_capita....           generation  
##  Min.   :   251     Boomers        :4990  
##  1st Qu.:  3447     G.I. Generation:2744  
##  Median :  9372     Generation X   :6408  
##  Mean   : 16866     Generation Z   :1470  
##  3rd Qu.: 24874     Millenials     :5844  
##  Max.   :126352     Silent         :6364  
## 

Figuring out NAs in ‘suiciderates’ dataset:

Out of entire dataset (27820 observations of 12 variables), what % are NAs?

sum(is.na(suiciderates))/27820*12
## [1] 8.392236

For column Human Development Index (HDI) for year, what % are NAs?

sum(is.na(suiciderates$HDI.for.year))/27820
## [1] 0.699353

Since there are 8.39% of NAs in the dataset, and the variable ‘HDI for year’ consists of 70% NAs, we have decided to completely ignore that variable in our analyses, since ‘HDI for year’ values wouldn’t be significant to factor in when looking at suicide rates due to lack of data.

Removing NAs and creating refined ‘suicideratesnew’ dataset:

Next, I will select for only the variables I am interested in, thus removing ‘HDI for year’.

suicideratesnew <- suiciderates %>% 
  select(-HDI.for.year)
DT::datatable(suicideratesnew)

I will now check to see how many NAs are still remaining in this dataset:

sum(is.na(suicideratesnew))/27820*11
## [1] 0

There are now 0% of NAs in the new dataset, further exemplifying that ‘HDI for year’ contained all the NAs.

Exploratory Data Analysis of ‘suicideratesnew’:

How many rows?

nrow(suicideratesnew)
## [1] 27820

How many columns?

ncol(suicideratesnew)
## [1] 11

Summary of suicideratesnew dataset:

summary(suicideratesnew)
##         country           year          sex                 age      
##  Austria    :  382   Min.   :1985   female:13910   15-24 years:4642  
##  Iceland    :  382   1st Qu.:1995   male  :13910   25-34 years:4642  
##  Mauritius  :  382   Median :2002                  35-54 years:4642  
##  Netherlands:  382   Mean   :2001                  5-14 years :4610  
##  Argentina  :  372   3rd Qu.:2008                  55-74 years:4642  
##  Belgium    :  372   Max.   :2016                  75+ years  :4642  
##  (Other)    :25548                                                   
##   suicides_no        population       suicides.100k.pop
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00   
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92   
##  Median :   25.0   Median :  430150   Median :  5.99   
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82   
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62   
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97   
##                                                        
##       country.year   gdp_for_year....    gdp_per_capita....
##  Albania1987:   12   Min.   :4.692e+07   Min.   :   251    
##  Albania1988:   12   1st Qu.:8.985e+09   1st Qu.:  3447    
##  Albania1989:   12   Median :4.811e+10   Median :  9372    
##  Albania1992:   12   Mean   :4.456e+11   Mean   : 16866    
##  Albania1993:   12   3rd Qu.:2.602e+11   3rd Qu.: 24874    
##  Albania1994:   12   Max.   :1.812e+13   Max.   :126352    
##  (Other)    :27748                                         
##            generation  
##  Boomers        :4990  
##  G.I. Generation:2744  
##  Generation X   :6408  
##  Generation Z   :1470  
##  Millenials     :5844  
##  Silent         :6364  
## 

Plots

In this first plot, we will look at how suicides may differ between generations, globally between 1985-2016.

gen_suicides <- suicideratesnew %>% 
  group_by(generation) %>% 
  summarise("mean_suicides"=mean(suicides_no)) 
DT::datatable(gen_suicides)
gen_suicides %>% 
  ggplot() +
  geom_col(aes(x=fct_reorder(generation, mean_suicides),y=mean_suicides, fill=generation)) +
  xlab("Generation") +
  ylab("Mean # of suicides") +
  theme_minimal() +
  coord_flip() + 
  ggtitle("Average number of suicides globally across generations (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

In the second plot, we look at how suicide rates have changed over the years, particularly in Canada, and see if there is a trend.

canada_suicides <- suicideratesnew %>% 
  filter(country== 'Canada') %>% 
  group_by(year) %>% 
  summarise("sum_suicides"=sum(suicides_no))
DT::datatable(canada_suicides)
canada_suicides %>% 
  ggplot() +
  geom_line(aes(x=year, y=sum_suicides)) +
  xlab("Year") +
  ylab("Sum of suicides") +
  theme_minimal() +
  ggtitle("Number of suicides in Canada (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

Lastly, we will see the distribution of suicides between sexes within the entire dataset.

suicideratesnew %>% 
  ggplot() +
  geom_violin(aes(x=sex, y= suicides_no, fill=sex)) +
  xlab("Sex") +
  ylab("Number of suicides") +
  theme_minimal() +
  ggtitle("Distribution of suicides between sexes, globally (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

Research question & plan of action

Research Question

Between 1985-2016, how did suicide rates differ between sexes and generations, and is there a significant correlation with the amount of GDP per capita for each country?

How?

With our research question, we are interested in the suicide rates among different generations. Later, we will perform a linear regression analysis and plot the relevant variables (variables of interest) with a regression line after we come to a conclusion that there is a relationship between these variables.